AITopics | robot policy

Collaborating Authors

robot policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning Robot Manipulation from Audio World Models

Zhang, Fan, Gienger, Michael

arXiv.org Artificial IntelligenceDec-10-2025

World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns. Research in this domain has primarily concentrated on the following directions: 1) video-based models (Liang et al., 2025; Assran et al., 2025) that predict future visual frames from present observations, encoding the causal dependencies critical for physical interaction.

artificial intelligence, arxiv preprint arxiv, spectrogram, (14 more...)

arXiv.org Artificial Intelligence

2512.08405

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.91)

Add feedback

Invariance Co-training for Robot Visual Generalization

Yang, Jonathan, Finn, Chelsea, Sadigh, Dorsa

arXiv.org Artificial IntelligenceDec-8-2025

Abstract-- Reasoning from diverse observations is a fundamental capability for generalist robot policies to operate in a wide range of environments. Despite recent advancements, many large-scale robotic policies still remain sensitive to key sources of observational variation--such as changes in camera perspective, lighting, and the presence of distractor objects. We posit that the limited generalizability of these models arises from the substantial diversity required to robustly cover these quasistatic axes, coupled with the current scarcity of large-scale robotic datasets that exhibit rich variation across them. In this work, we propose to systematically examine what robots need to generalize across these challenging axes by introducing two key auxiliary tasks--state similarity and invariance to observational perturbations--applied to both demonstration data and static visual data. We then show that via these auxiliary tasks, leveraging both more-expensive robotic demonstration data and less-expensive, visually rich synthetic images generated from non-physics-based simulation (e.g., Unreal Engine) can lead to substantial increases in generalization to unseen camera viewpoints, lighting configurations, and distractor conditions. Our results demonstrate that co-training on this diverse data improves performance by 18% over existing generative augmentation methods. Robotic foundation models have shown impressive progress in generalizing to everyday scenarios by leveraging large-scale datasets spanning multiple embodiments, environments, and tasks [1], [2]. However, despite their breadth, the resulting models often remain brittle in real-world settings--failing to handle unseen spatial configurations of objects or adapt to drastic visual changes such as lighting and viewpoint shifts. We hypothesize that the brittleness of current robotic policies stems from insufficient coverage of key observational factors during training. For example, many large-scale datasets provide only one or two third-person perspectives per scene, limiting robustness to viewpoint shifts.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.0523

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Teaching robot policies without new demonstrations: interview with Jiahui Zhang and Jesse Zhang

RobohubDec-4-2025, 10:47:14 GMT

In their paper ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations, which was presented at CoRL 2025, and introduce a framework for learning robot manipulation tasks solely from language instructions without per-task demonstrations. We asked Jiahui Zhang and Jesse Zhang to tell us more. What is the topic of the research in your paper, and what problem were you aiming to solve? Our research addresses the problem of enabling robot manipulation policies to solve novel, language-conditioned tasks without collecting new demonstrations for each task. We begin with a small set of demonstrations in the deployment environment, train a language-conditioned reward model on them, and then use that learned reward function to fine-tune the policy on unseen tasks, with no additional demonstrations required.

artificial intelligence, demonstration, reward model, (14 more...)

Robohub

Country:

North America > United States > California (0.16)
North America > United States > Texas (0.05)
North America > United States > Oregon (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.49)

Add feedback

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Atreya, Pranav, Pertsch, Karl, Lee, Tony, Kim, Moo Jin, Jain, Arhan, Kuramshin, Artur, Eppner, Clemens, Neary, Cyrus, Hu, Edward, Ramos, Fabio, Tremblay, Jonathan, Arora, Kanav, Ellis, Kirsty, Macesanu, Luca, Villasevil, Marcel Torne, Leonard, Matthew, Cho, Meedeum, Aslan, Ozgur, Dass, Shivin, Wang, Jie, Reger, William, Yuan, Xingfang, Yang, Xuning, Gupta, Abhishek, Jayaraman, Dinesh, Berseth, Glen, Daniilidis, Kostas, Martin-Martin, Roberto, Lee, Youngwoon, Liang, Percy, Finn, Chelsea, Levine, Sergey

arXiv.org Artificial IntelligenceDec-2-2025

Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.18123

Country: North America > United States (0.28)

Genre:

Research Report (0.50)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.86)
(2 more...)

Add feedback

RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation

Jangir, Yash, Zhang, Yidi, Yamazaki, Kashu, Zhang, Chenyu, Tu, Kuan-Hsun, Ke, Tsung-Wei, Ke, Lei, Bisk, Yonatan, Fragkiadaki, Katerina

arXiv.org Artificial IntelligenceOct-28-2025

The pursuit of robot generalists - instructable agents capable of performing diverse tasks across diverse environments - demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains and cannot assess models trained from real-world demonstrations or alternative simulation environments. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated VLM-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, such as textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.23571

Genre: Research Report (0.51)

Industry: Leisure & Entertainment > Games > Computer Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

MemER: Scaling Up Memory for Robot Control via Experience Retrieval

Sridhar, Ajay, Pan, Jennifer, Sharma, Satvik, Finn, Chelsea

arXiv.org Artificial IntelligenceOct-24-2025

Humans routinely rely on memory to perform tasks, yet most robot policies lack this capability; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Videos and code can be found at https://jen-pan.github.io/memer/. In recent times, we have seen significant strides in the language-following and generalization capabilities of robotic manipulation policies (Brohan et al., 2023; Intelligence et al., 2025; Kim et al., 2024; NVIDIA et al., 2025; Team et al., 2025). While these policies are improving for real-world deployment, a critical limitation remains: the absence of long-term memory. Memory allows humans to handle the inherent partial observability found in their environment. For instance, if a person wanted to make a sandwich, they would have to recall where they saw the jar of peanut butter or the knife, especially if these items had not been recently viewed. The ability to form and retrieve memories is a crucial step towards robots solving complex, multi-step tasks.

artificial intelligence, high-level policy, subtask, (18 more...)

arXiv.org Artificial Intelligence

2510.20328

Genre: Research Report (0.52)

Industry: Health & Medicine > Consumer Health (0.48)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Policy Contrastive Decoding for Robotic Foundation Models

Wu, Shihan, Luo, Xu, Zhang, Ji, Xie, Junlin, Song, Jingkuan, Shen, Heng Tao, Gao, Lianli

arXiv.org Artificial IntelligenceOct-21-2025

Robotic foundation models, or generalist robot policies, hold immense potential to enable flexible, general-purpose and dexterous robotic systems. Despite their advancements, our empirical experiments reveal that existing robot policies are prone to learning spurious correlations from pre-training trajectories, adversely affecting their generalization capabilities beyond the training data. To tackle this, we propose a novel Policy Contrastive Decoding (PCD) approach, which redirects the robot policy's focus toward object-relevant visual clues by contrasting action probability distributions derived from original and object-masked visual inputs. As a training-free method, our PCD can be used as a plugin to improve different types of robot policies without needing to finetune or access model weights. We conduct extensive experiments on top of three open-source robot policies, including the autoregressive policy OpenVLA and the diffusion-based policies Octo and $π_0$. The obtained results in both simulation and real-world environments prove PCD's flexibility and effectiveness, e.g., PCD enhances the state-of-the-art policy $π_0$ by 8.9% in the simulation environment and by 108% in the real-world environment. Code and demos are publicly available at: https://Koorye.github.io/proj/PCD.

artificial intelligence, etal, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.13255

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

WorldGym: World Model as An Environment for Policy Evaluation

Quevedo, Julian, Sharma, Ansh Kumar, Sun, Yixiang, Suryavanshi, Varad, Liang, Percy, Yang, Sherry

arXiv.org Artificial IntelligenceOct-1-2025

Robots can help humans in ways that range from home robots performing chores (Shafiullah et al., With the development of generative models trained on large-scale video data (Ho et al., 2022; Villegas et al., 2022; Singer et al., 2022), recent work has shown that video world models can visually emulate See videos and code at https://world-model-eval.github.io Inspired by this observation, we propose a world-model-based policy evaluation environment (WorldGym), as shown in Figure 1. To enable efficient rollouts of policies which predict different-length action chunks, WorldGym aligns its diffusion horizon length with policies' chunk sizes at inference time. With video rollouts from the world model, WorldGym then uses a vision-language model (VLM) to determine tasks' success We then use the world model to evaluate VLA-based robot policies by rolling out the policies in the world model starting from real initial frames, and compare their success rates (policy values) in WorldGym to those achieved in real-world experiments. We propose flexibly aligning diffusion horizon length with policies' action chunk sizes for efficient We consider a multi-task, finite-horizon, partially observable Markov Decision Process (POMDP) (Puterman, 2014; Kaelbling et al., 1995), specified by In this section, we first describe our implementation of world model training and inference.

artificial intelligence, machine learning, world model, (15 more...)

arXiv.org Artificial Intelligence

2506.00613

Country: Europe (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology (0.67)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
(2 more...)

Add feedback

SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration

Jin, Yang, Lv, Jun, Xue, Han, Chen, Wendi, Wen, Chuan, Lu, Cewu

arXiv.org Artificial IntelligenceSep-24-2025

Y et robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement.

artificial intelligence, machine learning, reinforcement learning, (11 more...)

arXiv.org Artificial Intelligence

2509.19292

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.49)

Add feedback

Fine-Tuning Robot Policies While Maintaining User Privacy

Christie, Benjamin A., Parekh, Sagar, Losey, Dylan P.

arXiv.org Artificial IntelligenceSep-24-2025

Recent works introduce general-purpose robot policies. These policies provide a strong prior over how robots should behave -- e.g., how a robot arm should manipulate food items. But in order for robots to match an individual person's needs, users typically fine-tune these generalized policies -- e.g., showing the robot arm how to make their own preferred dinners. Importantly, during the process of personalizing robots, end-users leak data about their preferences, habits, and styles (e.g., the foods they prefer to eat). Other agents can simply roll-out the fine-tuned policy and see these personally-trained behaviors. This leads to a fundamental challenge: how can we develop robots that personalize actions while keeping learning private from external agents? We here explore this emerging topic in human-robot interaction and develop PRoP, a model-agnostic framework for personalized and private robot policies. Our core idea is to equip each user with a unique key; this key is then used to mathematically transform the weights of the robot's network. With the correct key, the robot's policy switches to match that user's preferences -- but with incorrect keys, the robot reverts to its baseline behaviors. We show the general applicability of our method across multiple model types in imitation learning, reinforcement learning, and classification tasks. PRoP is practically advantageous because it retains the architecture and behaviors of the original policy, and experimentally outperforms existing encoder-based approaches. See videos and code here: https://prop-icra26.github.io.

artificial intelligence, machine learning, personalization, (17 more...)

arXiv.org Artificial Intelligence

2509.18311

Genre: Research Report (0.83)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback